Data Analysis II¶

Welcome to the Data Analysis II task. In this task we are going to work on how to handle missing data as well as Data Scalling and Normalization. Data can have missing values for a number of reasons such as observations that were not collected or data corruption. We will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python.

Working with missing data¶

Take a first look at the data¶

This should always be the first step during any analysis. Depending on data sources, missing data are identified differently. Pandas identifies missing values as NaN. However, unless the data has been pre-processed to a degree that an analyst will encounter missing values as NaN. Missing values can appear as a question mark (?) or a zero (0) or minus one (-1) or a blank. So it is important to view and understand your data before anything else.

We are going to use the Dataset 'balance.txt'. From the first five observations, we can see there are several missing values. They have been represented as NaN.

import pandas as pd
import numpy as np
# for Box-Cox Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns

import matplotlib.pyplot as plt




np.random.seed(0) 

df = pd.read_table("balance.txt", delim_whitespace=True)
df.head()

C:\Users\byamu\Anaconda3\lib\site-packages\ipykernel_launcher.py:19: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.

How many NaN values can you point out?

Now that we have seen that there are missing values, we need to get a sense of how many missing values exist in the whole data set. We are going to find how many values we have in each column.

# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Balance       0
Income        6
Limit        15
Rating       11
Cards         1
Age          10
Education     6
Gender        8
Student       7
Married       3
dtype: int64

In order to get an even better sense. We need to see what percentage of all the values are missing. If we get a very high percentage we may not be able to use the data.

# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

1.7727272727272727

Seems like only 1.77% of the values are missing

Drop missing values¶

One option you have is to remove any rows or columns that contain missing values. This strategy is not generally recommended for datasets with important/sensitive data or datasets with a small number of observations. It's usually worth it to take the time to go through your data and really look at all the columns with missing values to really get to understanding your dataset.

If you're sure you want to drop rows with missing values, pandas does have a handy function, dropna() to help you do this. Let's try it out on our data.

# Create a temporary dataset
temp_df = df

# remove all the rows that contain a missing value.

temp_df.dropna().head()

As you can observe observations 0,1,2,3 have been dropped from the dataframe because they contained Null values.

The above proceedure removed observations. Now let us look int removing Columns.

# Create a temporary dataset
temp_df = df


# remove all columns with at least one missing value
columns_with_na_dropped = temp_df.dropna(axis=1)
columns_with_na_dropped.head()

Looks like all columns had atleast one missing values excepts the Balance column. This is one of the reasons why droping data for having NaN's may not be a very good solution. We'll explore other options.

Filling in missing values automatically¶

We can use the Panda's fillna() function to fill in missing values in a dataframe. One option we have is to specify what we want the NaN values to be replaced with. Here, we can replace all NaN with 0.

# Create a temporary dataset
temp_df = df

# replace all NA's with 0
temp_df.fillna(0).head()

Another way is to replace the missing values with whatever values that come just before or just after it in the same column. This can be used in datasets where the observations are in some sorted or logical error.

# Create a temporary dataset
temp_df = df

# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
temp_df.fillna(method = 'bfill', axis=0).fillna(0).head()

Other ways of handling missing data¶

Mean, Median and Mode Imputation¶

Using the measures of central tendency involves substituting the missing values with the mean or median for numerical variables and the mode for categorical variables. This imputation technique works well with when the values are missing completely at random. One disadvantage is that mean imputation reduces variance in the dataset.

Imputation with Linear Regression¶

This is an imputation technique that utilizes variables from the observed data to replace the missing values with predicted values from a regression model. Complete observations are used to generate the regression equation; the equation is then used to predict missing values for incomplete observations. In an iterative process, values for the missing variable are inserted and then all cases are used to predict the dependent variable. These steps are repeated until there is little difference between the predicted values from one step to the next, that is they converge. The major drawback of using this method is that it reduces variability. Though we have not introduced regression yet, its important to keep this in mind.

k-Neareast Neighbor (kNN) Imputation¶

For k-Nearest Neighbor imputation, the values are obtained by using similarity-based methods that rely on distance metrics (Euclidean distance, Jaccard similarity, Minkowski norm etc). They can be used to predict both discrete and continuous attributes. KNN workes by finding other observations that are almost similar to the observation with the missing value. For example, if the observation is 'Female' and 'Asian' we will find other users similar to her and get the mean or mode of the missing value. The main disadvantage of using kNN imputation is that it becomes time-consuming when analyzing large datasets because it searches for similar instances through all the dataset.

Scaling and Normalization¶

Scaling vs. Normalization: What's the difference?¶

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data. Let's talk a little more in-depth about each of these options.

Scaling¶

This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points, like support vector machines, or SVM or k-nearest neighbors, or KNN. With these algorithms, a change of "1" in any numeric feature is given the same importance.

For example, you might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions of the world. With currency, you can convert between currencies. But what about if you're looking at something like height and weight? It's not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).

By scaling your variables, you can help compare different variables on equal footing. To help solidify what scaling looks like, let's look at a made-up example. (Don't worry, we'll work with real data in just a second, this is just to help illustrate the point.)

# generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size = 1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

Text(0.5, 1.0, 'Scaled data')

Notice that the shape of the data doesn't change, but that instead of ranging from 0 to 8ish, it now ranges from 0 to 1.

Practice¶

# read in all our data
countries = pd.read_csv(r"C:\Users\byamu\Dropbox\Tony Byamungu-28775\Data Analytics and Exploration\Task 14\countries.csv")

countries.head()

# scale the goals from 0 to 1
scaled_data = minmax_scaling(countries['IT.CEL.SETS.P2'], columns = [0])

# plot the original & scaled data together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(countries['IT.CEL.SETS.P2'], ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

Text(0.5, 1.0, 'Scaled data')

You can see that scaling changed the scales of the plots dramatically (but not the shape of the data: it looks like most campaigns have small goals but a few have very large ones)

Normalization¶

Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.

Normal distribution: Also known as the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.

In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)

# normalize the pledges (w/ Box-Cox)
positive_pledges = countries['NE.EXP.GNFS.ZS']
normalized_pledges = stats.boxcox(positive_pledges)[0]

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(positive_pledges, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_pledges, ax=ax[1])
ax[1].set_title("Normalized data")

Text(0.5, 1.0, 'Normalized data')

Compulsory Tasks¶

# Read in Pakistan.csv and display the first 5 columns

pakistan = pd.read_csv(r"C:\Users\byamu\Dropbox\Tony Byamungu-28775\Data Analytics and Exploration\Task 14\Pakistan.csv", encoding='latin')
pakistan.head()

# get the number of missing data points per column
total_missing = pakistan.isnull().sum()


# look at the # of missing points in the first ten columns

total_missing[0:10]

S#                  0
Date                0
Islamic Date      154
Blast Day Type     10
Holiday Type      424
Time              211
City                0
Latitude            3
Longitude           3
Province            0
dtype: int64

Write a note on the reason you think we have missing data on the three top columns : Islamic date, Holiday type and Time. remember to classify them according to the three categories we analysed in the pdf.

Time¶

i think the values are missing in this column because: there was no way to accurately record the time. perhaps the person who was collecting the data couldn't find a trusted source to accurately record the data and there's no possible way we can accurately impute replace the missing so this is MnAR

holiday type¶

I think we have missing value in this column because everyday cannot be a holiday so it makes sense to have other rows with missing values in this column simply because it was not a holiday. and the missing values has nothing to do with its hypothetical value and the values of other data. so this is Missing completely at Random (McAR)

Islamic Date¶

the Data are missing at random... because the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. MAR

For the following example, decide whether scaling or normalization makes more sense.

You want to build a linear regression model to predict someone's grades given how much time they spend on various activities during a normal school week. You notice that your measurements for how much time students spend studying aren't normally distributed: some students spend almost no time studying and others study for four or more hours every day. Should you scale or normalize this variable?

here because we are going use some of machine learning algorithm to build a linear regression model and the fact that the time students spend studying aren't normally ditributed NORMALISATION IS THE WAY TO GO

You're still working on your grades study, but you want to include information on how students perform on several fitness tests as well. You have information on how many jumping jacks and push-ups each student can complete in a minute. However, you notice that students perform far more jumping jacks than push-ups: the average for the former is 40, and for the latter only 10. Should you scale or normalize these variables?

Here because we just need to change the range or to put the former and the latter observation on the same scale. we will need to use SCALING

# From the countries dataset scale the "EG.ELC.ACCS.ZS" column and visualize it as well

eg = countries['EG.ELC.ACCS.ZS']

eg_scaled = minmax_scaling(eg, columns=[0])


plt.subplot(1,2,1)
plt.hist(eg)

plt.subplot(1,2,2)

plt.hist(eg_scaled)

(array([ 2.,  6.,  2.,  0.,  0.,  4.,  3.,  4.,  2., 21.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 10 Patch objects>)

# From the countries dataset normalize the "SP.DYN.CBRT.IN" column
sp = countries['SP.DYN.CBRT.IN']

sp_normalized = stats.boxcox(sp)[0]

plt.figure(figsize=(8,8))
plt.subplot(121)
plt.hist(sp)

plt.subplot(122)
plt.hist(sp_normalized)

(array([5., 3., 6., 2., 5., 4., 8., 0., 4., 7.]),
 array([2.29960813, 2.43087318, 2.56213822, 2.69340327, 2.82466831,
        2.95593336, 3.0871984 , 3.21846345, 3.3497285 , 3.48099354,
        3.61225859]),
 <a list of 10 Patch objects>)

	Balance	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity
0	12.2407984760474	14.891	3606	283.0	2.0	NaN	11.0	Male	No	Yes	Caucasian
1	23.2833339223376	NaN	6645	483.0	3.0	82.0	15.0	Female	NaN	Yes	Asian
2	22.5304088790893	104.593	7075	NaN	4.0	71.0	11.0	Male	No	No	Asian
3	27.652810710665	148.924	9504	681.0	3.0	36.0	11.0	NaN	No	No	Asian
4	16.8939784904888	55.882	4897	357.0	2.0	68.0	16.0	Male	No	Yes	Caucasian

	Balance	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity
4	16.8939784904888	55.882	4897	357.0	2.0	68.0	16.0	Male	No	Yes	Caucasian
5	22.4861776123913	80.18	8047	569.0	4.0	77.0	10.0	Male	No	No	Caucasian
6	10.5745164367595	20.996	3388	259.0	2.0	37.0	12.0	Female	No	No	African American
7	14.5762043512884	71.408	7114	512.0	2.0	87.0	9.0	Male	No	No	Asian
8	7.93809029500252	15.125	3300	266.0	5.0	66.0	13.0	Female	No	No	Caucasian

	Balance
0	12.2407984760474
1	23.2833339223376
2	22.5304088790893
3	27.652810710665
4	16.8939784904888

	Balance	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity
0	12.2407984760474	14.891	3606	283.0	2.0	0.0	11.0	Male	No	Yes	Caucasian
1	23.2833339223376	0	6645	483.0	3.0	82.0	15.0	Female	0	Yes	Asian
2	22.5304088790893	104.593	7075	0.0	4.0	71.0	11.0	Male	No	No	Asian
3	27.652810710665	148.924	9504	681.0	3.0	36.0	11.0	0	No	No	Asian
4	16.8939784904888	55.882	4897	357.0	2.0	68.0	16.0	Male	No	Yes	Caucasian

	Balance	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity
0	12.2407984760474	14.891	3606	283.0	2.0	82.0	11.0	Male	No	Yes	Caucasian
1	23.2833339223376	104.593	6645	483.0	3.0	82.0	15.0	Female	No	Yes	Asian
2	22.5304088790893	104.593	7075	681.0	4.0	71.0	11.0	Male	No	No	Asian
3	27.652810710665	148.924	9504	681.0	3.0	36.0	11.0	Male	No	No	Asian
4	16.8939784904888	55.882	4897	357.0	2.0	68.0	16.0	Male	No	Yes	Caucasian

	Country	Country Groups	BX.KLT.DINV.WD.GD.ZS	EG.ELC.ACCS.ZS	EG.FEC.RNEW.ZS	EN.ATM.CO2E.PC	FP.CPI.TOTL.ZG	IT.CEL.SETS.P2	IT.NET.USER.ZS	NE.EXP.GNFS.ZS	...	NY.GDP.PCAP.CD	SP.DYN.AMRT.FE	SP.DYN.AMRT.MA	SP.DYN.CBRT.IN	SP.DYN.CDRT.IN	SP.DYN.IMRT.IN	SP.DYN.LE00.IN	SP.DYN.TFRT.IN	SP.POP.GROW	SP.URB.TOTL.IN.ZS
0	CEB	Central Europe and the Baltics	1.555790	100.000000	14.538355	6.820042	1.840965	122.192106	58.599297	52.333390	...	12567.128310	77.039930	189.578062	10.543330	11.429917	6.019183	75.295349	1.465325	-0.362710	62.059417
1	CSS	Caribbean small states	4.658176	93.114511	9.096342	9.277109	3.250344	113.628493	35.407690	44.935642	...	8726.326154	121.928584	200.568068	17.555646	7.436407	18.406085	72.252392	2.151960	0.642147	42.083135
2	EAP	East Asia & Pacific (excluding high income)	3.796483	94.997330	16.471817	5.106045	3.789836	69.905604	28.957482	30.572597	...	4007.323783	91.717833	134.835195	14.439063	7.019396	19.100000	73.482345	1.843869	0.713160	47.696831
3	EAR	Early-demographic dividend	2.073571	79.455104	26.481427	2.119827	4.580200	68.069446	12.835425	27.711023	...	2926.483907	150.595639	214.978616	22.914627	7.028941	40.042149	67.955732	2.763140	1.534541	42.113134
4	EAS	East Asia & Pacific	2.930966	95.496185	13.729447	5.701783	3.247588	73.657018	34.269800	32.153225	...	7674.475705	86.702958	130.465277	13.880915	7.095146	18.136433	74.383915	1.797065	0.675231	51.791229

	S#	Date	Islamic Date	Blast Day Type	Holiday Type	Time	City	Latitude	Longitude	Province	...	Targeted Sect if any	Killed Min	Killed Max	Injured Min	Injured Max	No. of Suicide Blasts	Explosive Weight (max)	Hospital Names	Temperature(C)	Temperature(F)
0	1	November 19-1995	25 Jumaada al-THaany 1416 A.H	Holiday	Weekend	NaN	Islamabad	33.7180	73.0718	Capital	...	None	14.0	15.0	NaN	60	2.0	NaN	NaN	15.835	60.503
1	2	November 6-2000	10 SHa`baan 1421 A.H	Working Day	NaN	NaN	Karachi	24.9918	66.9911	Sindh	...	None	NaN	3.0	NaN	3	1.0	NaN	NaN	23.770	74.786
2	3	May 8-2002	25 safar 1423 A.H	Working Day	NaN	7:45 AM	Karachi	24.9918	66.9911	Sindh	...	Christian	13.0	15.0	20.0	40	1.0	2.5 Kg	1.Jinnah Postgraduate Medical Center 2. Civil ...	31.460	88.628
3	4	June 14-2002	3 Raby` al-THaany 1423 A.H	Working Day	NaN	11:10:00 AM	Karachi	24.9918	66.9911	Sindh	...	Christian	NaN	12.0	NaN	51	1.0	NaN	NaN	31.430	88.574
4	5	July 4-2003	4 Jumaada al-awal 1424 A.H	Working Day	NaN	NaN	Quetta	30.2095	67.0182	Baluchistan	...	Shiite	44.0	47.0	NaN	65	1.0	NaN	1.CMH Quetta \n2.Civil Hospital 3. Boland Medi...	33.120	91.616